Due Friday April 8, 5:00PM Central.
Initiate your repo: https://classroom.github.com/a/-PgTc6cw Submit by pushing your code to your repo on Github Classroom. This submission is my work alone and complies with the 30535 integrity policy. Add your initials to indicate your agreement: KEZG Add names of anyone you discussed this problem set with: NO Late coins used this pset: 0. Late coins left after submission: 5. We use (*) to indicate a problem that we think might be time consuming. Notes on submission (10 pts)
1. Git is software for distributed version control. List 4 benefits of distributed version control.
Source: https://about.gitlab.com/topics/version-control/benefits-distributed-version-control-system/
2. For the next questions, we will reference “Learn git concepts, not commands”. Read sections from “Overview” through “Branching”. It is written with git commandline in mind, but github desktop has all these features as well. Focus on the concepts (ie understand the pictures). Save some changes to your homework and make sure it’s being tracked by git with github deskop.
Remote Github repository https://github.com/datasci-harris/skills-problem-set-2-khristel26.git
In desktop: Users/khristelzavaleta/Documents/GitHub/skills-problem-set-2-khristel26
ii. How do you add a file to staging in github desktop? (This is subtle, because it happens automatically).
On GitHub.com, we can navigate to the main page of the repository. Then, above the list of files, we can click “Add file” and then click “Upload files”.
We can add a file to the folder that contains our repository on our computer (when the repository is cloned on our computer), and then commit the file and push it to GitHub.
iii. How do you commit an issue to the local repository? (This is not subtle).
We have to go to the Github application on our computer, and then click on the “commit to main” option, previously we have the option to give the commit a summary and a description.
iv. How does github desktop decide what part of your code to show in the main part of the window?
Github desktop shows in the main part of the screen, the changes made to the file. The lines of code appear in red or green to indicate deletes or additions of code.
v. What branch are you on right now? Why?
I am now on the main branch, this branch is assigned by default at the start.
vi. If you were to click on “current branch”, type a name and click the “New Branch” button, you would create a new branch.
a. What would happen to the files in your working directory?
Now in “current branch” there are two branches, and therefore a file can be edited separately in this second branch without this implying a change in the other branch (a change in “New Branch” would not be reflected in “Main”).
b. What would happen in the remote repo?
Two branches and two versions of the code appear in the remote GitHub repository.
c. Why would you want to work on a different branch?
When it is important to preserve the original code into a chain of changes, for reference purposes. It may be that the main branch is a model version of what you are looking for, and will serve to guide all future codes and changes
If you created a new branch in the previous step, nice! Experimentation develops your skills and understanding.Now, make sure you are on master as you continue your homework.
1. Why doesn’t this code work?
my_variable <- 10 my_variab1e
This formula is not working because the name of the variable created is “my_variable” and not “my_variab1e” (a 1 instead of an “l”), it is necessary to call the variable by the same name assigned to have the result ‘10’.
2. Tweak each of the following R commands so that they run correctly:
library(tidilyverse)
ggplot(dota = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
fliter(mpg, cyl = 8)
filter(diamond, carat > 3)
Modifying:
library(tidyverse)
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
filter(mpg, cyl == 8)
filter(diamonds, carat > 3)
3. Press Alt + Shift + K. What happens? How can you get to the same place using the menus?
The RStudio keyboard shortcuts appears on screen. We can get the same results going to menu at the top: Tools > Keyboard Shortcuts Help.
1. Using the flights data from the nycflights13 package. Find all flights that:
library(nycflights13)
view(flights)
?flights
i. Had an arrival delay of three or more hours
filter(flights,arr_delay/60>=3)
ii. Flew to Houston (IAH or HOU)
filter(flights,dest == "IAH" | dest == "HOU")
iii. Were operated by United, American, or Southwest
view(airlines)
filter(flights,carrier == "UA" | carrier == "AA" | carrier == "WN")
iv. Departed in spring (March, April, May and June)
filter(flights,month == 3 | month == 4 | month == 5 | month == 6)
v. Arrived more than two hours late, but didn’t leave late
filter(flights,arr_delay/60>2 & dep_delay <= 0)
vi. Were delayed by at least an hour, but made up over 30 minutes in flight
filter(flights,dep_delay/60>=1 & dep_delay - 30 > arr_delay)
vii. Departed between midnight and 5am (inclusive)
filter(flights, dep_time>0 & dep_time<=500)
2. We reproduced a graph from the mpg data. There is one green triangle point that has very low gas mileage. What car is it?
view(mpg)
filter(mpg, drv == "f" & displ<4 & cty<15)
1. Common bugs: You want to see missing values in a dataframe. You run the following code and get no results. Why is wrong?
filter(flights, arr_time == NA)
It’s wrong because, to check for missing values we have to use the code “is.na” and not filter.
2. How many flights have a missing dep_time? Among these flights, what other variables are missing? What might these rows represent?
These rows can represent flights that never departed, or in other words, that were scheduled but later cancelled.
There are 8,255 flights that have a missing departure time.
sum(is.na(flights$dep_time))
## [1] 8255
filter(flights, is.na(dep_time))
x <-filter(flights, is.na(dep_time))
sapply(x, function(x) sum(is.na(x)))
## year month day dep_time sched_dep_time
## 0 0 0 8255 0
## dep_delay arr_time sched_arr_time arr_delay carrier
## 8255 8255 0 8255 0
## flight tailnum origin dest air_time
## 0 2512 0 0 8255
## distance hour minute time_hour
## 0 0 0 0
Source: https://discuss.analyticsvidhya.com/t/how-to-count-the-missing-value-in-r/2949/13
3. What does NA | TRUE evaluate to? Why?
NA | TRUE
## [1] TRUE
Evaluate for true, giving that the logical connector is “|” “or” and because at least one side it is TRUE
4. What does FALSE & NA evaluate to? Why?
FALSE & NA
## [1] FALSE
Evaluate for false, because NA represents something that is ambiguous or unknown (it could be true or false), and given that part of it is already FALSE, the result will be FALSE
1. What happens if you include the name of a variable multiple times in a select() call?
select(flights, day, month, day, year, day, year, dep_time)
If someone includes the name of variables multiple times, select just show it once, in the position where it first was included.
2. Typically, R is case-sensitive, but select helpers ignore case by default. Change the default to return an empty tibble.
select(flights, contains(“TIME”))
select(flights, contains("TIME", ignore.case=FALSE))
Source: https://stackoverflow.com/questions/8361589/turning-off-case-sensitivity-in-r
3. Brainstorm as many distinct ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.
Four ways:
select(flights, dep_time, dep_delay, arr_time, arr_delay)
select(flights, starts_with("dep_"), starts_with("arr_"))
select(flights, contains("_time") & -(contains("sched")) & -(contains("air")) | contains("delay"))
var <- c("dep_time", "dep_delay", "arr_time", "arr_delay")
select(flights,var)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(var)` instead of `var` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
1. Sort flights to find the most delayed flights by arrival time. End your code with %>% head(1) to only print the most delayed row.
arrange(flights, desc(arr_delay)) %>% head(1)
2. Find the top 5 flights that left earliest relative to their scheduled departure. Only show the tail number, date and number of minutes early they departed at.
arrange(flights, dep_delay) %>% head(5) %>% select(tailnum,year, month, day, dep_delay)
3. Using the flight data, how could you use arrange() to sort all missing values in a certain variable first, before the rest of the data? (Hint: use is.na()).
arrange(flights, desc(is.na(dep_time)))
**1.Currently dep_time and arr_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to number of minutes since midnight in new vars called _time_min, where here is variable and could be dep or arr**
flights <- mutate(flights, dep_time_min = (flights$dep_time %/%100)*60 + flights$dep_time %%100)
flights <- mutate(flights, arr_time_min = (flights$arr_time %/%100)*60 + flights$arr_time %%100)
flights %>%
select(tailnum, dep_time, arr_time, dep_time_min, arr_time_min)
2. Make a plot to illustrate the distribution of the difference between air_time with arr_time_min - dep_time_min. What do you notice about the distribution?
flights <- mutate(flights, diff= flights$arr_time_min - flights$dep_time_min)
view(flights)
ggplot(data = flights) + geom_point(mapping = aes(x=air_time, y= diff))
We would expect the value of air_time and the difference of arr_time_min - dep_time_min to be the same; however, from the distribution we can notice that the values are very far from each other in addition to being scattered. Therefore, we can conclude that there is a problem in how we are analyzing the data, since the points are very different from each other.
3. * We expect air_time and arr_time_min - dep_time_min to be the same, but in many cases they are not.
i. What fraction of flights have different values for these two columns?
sum(flights$air_time == flights$diff, na.rm = TRUE)
## [1] 196
sum(flights$air_time == flights$diff, na.rm = TRUE) / sum(!is.na(flights$diff))
## [1] 0.0005974462
ii. Why not? Identify an egregious data issue which causes an error of several hours.
I would expect the numbers to be the same, but the numbers don’t match because we aren’t accounting for flights where the departure time is close to midnight and the arrival time is the next day. In minutes, our max is 1440, after that the minutes start from zero again.
iii. Identify an additional issue that causes smaller errors
The difference may be due to the different time zones between the city of origin and the city of destination.
iv. Bonus: identify a third issue that causes smaller errors
It could be that the air time is not taking into account other factors such as the movement of an aircraft on the ground, taxiing, when arriving and departing.
4. If air_time is too far off from arr_time - dep_time then we assume one of the columns contains an error. Implement your fix to the major problem. Remake the plot from two questions prior.
flights <- mutate(flights, diff2 = ifelse(flights$arr_time_min >flights$dep_time_min, flights$arr_time_min - flights$dep_time_min, 1440 - flights$dep_time_min + flights$arr_time_min))
view(flights)
ggplot(data = flights) + geom_point(mapping = aes(x=air_time, y= diff2))
From the graph, we can see that the points are now much closer to each other. However, we might still have other problems like the ones listed above.
5. After implementing your fix, what fraction of observations still appear to have an error?
sum(flights$air_time == flights$diff2, na.rm = TRUE)
## [1] 201
sum(flights$air_time == flights$diff2, na.rm = TRUE) / sum(!is.na(flights$diff2))
## [1] 0.0006126872
6. Find the 10 least delayed flights using a ranking function. Carefully read the documentation for min_rank(). There may be more than 10 flights that are in the top 10! Why?
flights %>%
mutate(rank_dep_delay = min_rank(dep_delay)) %>%
arrange(rank_dep_delay) %>%
filter(rank_dep_delay <=10)
From the result, we can see that there are more than 10 flights that are in the top 10 this is because the code min_rank assign the ties values the same rank. We can see that there are 4 flights that are tie in the 9 rank and two that are tie in the rank 7.
1. In lecture, we defined a not_cancelled data frame. Come up with another approach that will give you the same output as not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt= distance) (without using count()).
not_cancelled <- filter(flights,!is.na(dep_delay),!is.na(arr_delay))
not_cancelled %>% group_by(dest) %>% summarise(n())
not_cancelled %>% group_by(tailnum) %>% summarise(sum(distance))
2. Calculate average delays by destination for flights originating in NYC. Create a variable which ranks destinations from worst to best, where 1 is the worst rank.
flights %>%
filter(origin == "LGA" | origin == "JFK") %>% #Assuming EWR is not in NYC
group_by(dest) %>%
summarise(average_delay = mean(dep_delay, na.rm = TRUE)) %>%
mutate(worstrank = min_rank(desc(average_delay)))%>%
arrange(worstrank) #From worst ranking (1) to the best ranking
3. What hour of the day should you fly if you want to avoid arrival delays as much as possible?
flights %>%
group_by(hour) %>%
summarise(mean_arrdelay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(mean_arrdelay)
To avoid arrival delay as much as possible, I should leave at 7 am. Likewise, it can be seen that the hours between 5 am and 9 am have a negative arrival delay, meaning that those flights had an early arrival.
4. Which plane (tailnum) has the most minutes of delays total?
Taking into account the question reference the departure delay:
flights %>%
group_by(tailnum) %>%
summarise(sum_depdelay = sum(dep_delay, na.rm = TRUE)) %>%
arrange(desc(sum_depdelay)) %>%
head(1)
5. Find all destinations that are flown by at least three carriers. Order the results from most carriers to fewest.
flights %>%
group_by(dest) %>%
summarise(number_carriers = length(unique(carrier))) %>%
arrange(desc(number_carriers)) %>%
filter(number_carriers>=3)
flights %>%
group_by(dest) %>%
mutate(number_carriers = length(unique(carrier))) %>%
filter(number_carriers>=3) %>%
group_by(carrier) %>%
summarise(number_dest = length(unique(dest))) %>%
arrange(desc(number_dest))
6. Calculate the number of non-cancelled flights by each carrier. Report the results in table in reverse alphabetical order with the min() and max() distance each airline covers.
flights %>%
group_by(carrier) %>%
filter(!is.na(dep_time), !is.na(arr_time)) %>%
summarise(non_cancelled = n(), min(distance), max(distance)) %>%
arrange(desc(carrier))
7. Which airlines only offer flights from New York state to one other airport?
flights %>%
filter(origin == "LGA" | origin == "JFK") %>% #Assuming EWR is not in NYC
group_by(carrier) %>%
summarise(number_dest = length(unique(dest))) %>%
filter(number_dest == 1)
Therefore, F9 and HA airlines offer flights from New York state to just one other airport.
**1. *We posit that arrival delays are more important than departure delays from a passanger’s perspective. We are interested in determining what airlines to avoid based on their delay characteristics.**
a. Calculate the median arrival delay by airline.
flights %>%
group_by(carrier) %>%
summarise(median_arr_delay = median(arr_delay, na.rm = TRUE)) %>%
arrange(desc(median_arr_delay))
b. Median arrival delay may not be the best metric for our needs. Brainstorm at least 2 additional ways to assess the typical delay characteristics of a group of flights.
c. Write the dplyr code which calculates your delay measures separately by airline. Do particular airlines perform poorly on multiple metrics?
flights %>%
group_by(carrier) %>%
summarise(mean_arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(desc(mean_arr_delay))
Base on their arrival delay characteristics, the five airlines that have the highest mean delay time are: F9, FL, EV, YV, OO. Therefore, it is recommended that those five be avoided.
flights %>%
group_by(carrier) %>%
summarise(percentage = sum(arr_delay >30, na.rm=TRUE)/ length(!is.na(arr_delay)) * 100) %>%
arrange(desc(percentage))
Base on their arrival delay characteristics, the five airlines that have the highest percentage of delay by more than 30 min are: F9, EV, YV, FL, OO. Therefore, it is recommended that those five be avoided.
1. * Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave.
dataframe1 <- flights %>%
arrange(origin, dest, month, day, sched_dep_time) %>%
group_by(origin) %>%
filter(!is.na(dep_time), !is.na(arr_time)) %>%
mutate(lag_depdelay = lag(dep_delay))
dataframe1
Months have been added in arrange, because otherwise the order is only done by day, causing errors with the lag variable
ggplot(data = dataframe1) + geom_point(mapping = aes(x=lag_depdelay, y=dep_delay)) + labs(x = "Previus Departure Delay Flight", y= "Departure Delay Flight")
dataframe1 %>%
group_by(lag_depdelay) %>%
summarise(mean_depdelay = mean(dep_delay)) %>%
ggplot() + geom_point(mapping = aes(x=lag_depdelay, y=mean_depdelay)) + labs(x = "Previus Departure Delay Flight", y= "Mean Departure Delay Flight")
2. Now we will look at delays that occur in the air. We will need a sense of how long a flight is.
flights %>%
group_by(dest) %>%
mutate(median_air_time = median(air_time,na.rm = TRUE)) %>%
mutate(most_delayed = air_time - median_air_time) %>%
select(carrier, flight, dest, air_time, median_air_time, most_delayed) %>%
arrange(desc(most_delayed))
In the table, we can see the from highest to lowest the flights that have been most late with respect to the median of the destination. Also, only the relevant columns have been selected to show.
3. For each plane, count the number of flights before the first delay of greater than 1 hour. (Hints: Construct a Boolean variable for every flight which measures whether it had a delay of greater than 1 hour and then use cumsum).
flights %>%
arrange(month, tailnum, day, dep_delay) %>%
group_by(tailnum) %>%
mutate(boolean_va = cumsum(dep_delay)<60) %>%
filter(boolean_va == 'TRUE') %>%
summarise(numberflights_beforefirstdelay = n())